fix: send FinishedHeight using safe height when finalized unavailable#72
fix: send FinishedHeight using safe height when finalized unavailable#72init4samwise wants to merge 1 commit intomainfrom
Conversation
When the host chain doesn't report finalized blocks (beacon client not synced, checkpoint sync incomplete, or testnet without proper finality reporting), the WAL grows unbounded because FinishedHeight is never sent. This change adds a fallback to use the safe block height for FinishedHeight when finalized blocks aren't available. Safe blocks are unlikely to reorg, making them a reasonable threshold for clearing the WAL. Closes ENG-1830
| let PairedHeights { host: _, rollup: safe_ru_height } = | ||
| self.load_safe_block_heights(ru_height)?; | ||
| let safe_heights = self.load_safe_block_heights(ru_height)?; | ||
| let safe_ru_height = safe_heights.rollup; |
There was a problem hiding this comment.
Yes, this is just refactoring to capture the full PairedHeights struct so we can use safe_heights.host in the fallback below. Happy to revert to the destructuring style if preferred — I can rename it to make the intent clearer:
let safe_heights @ PairedHeights { rollup: safe_ru_height, .. } =
self.load_safe_block_heights(ru_height)?;Or keep the current approach. Let me know your preference.
| // and rollup finalized heights are both 0. Otherwise, we load the finalized | ||
| // RU header and set the finalized block hash. | ||
| let finalized_ru_block_hash = | ||
| self.set_finalized_ru_block_hash(finalized_heights, genesis_ru_hash)?; |
There was a problem hiding this comment.
update the rustdoc for set_finalized_ru_block_hash to be more informative.
There was a problem hiding this comment.
Good point. I'll update this in the next commit. The current doc doesn't capture the full picture — it should also document:
- When it returns genesis hash (both heights = 0)
- When it loads and sets the actual finalized header
- The relationship with the new safe-height fallback
Will push an update.
| // Fallback: use safe height when finalized isn't available. | ||
| // Safe blocks are unlikely to reorg, so this is a reasonable | ||
| // threshold for clearing the WAL. | ||
| self.update_highest_processed_height(safe_heights.host)?; |
There was a problem hiding this comment.
in what cases will this fail. will failures cause difficult-to-recover issuss?
There was a problem hiding this comment.
Good question. Looking at update_highest_processed_height:
Failure cases:
sealed_header(safe_heights.host)?— fails if DB read errors. ReturnsErr, propagates up..expect("db inconsistent...")— panics if header doesn't exist. This shouldn't happen sinceload_safe_block_heightsalready validated these heights exist, but it's a hard crash if DB is corrupted.update_exex_head(...)→self.host.events.send(...)— fails if reth node shut down (channel closed).
Recovery:
- DB errors (1): Node crashes, restarts, resumes from last consistent state. No data loss.
- DB inconsistency (2): Indicates deeper corruption — needs investigation. Same as existing behavior.
- Channel closed (3): Node is shutting down anyway, graceful exit.
Is this safe for the fallback path?
The same failure modes exist on the finalized path. The only difference is when we call it — using safe vs finalized height. If safe_heights is valid (which load_safe_block_heights ensures), the call should succeed.
However, I see now from the Slack thread that this might be unrelated to the mainnet issue you're seeing. Should we hold off on this PR until the root cause is clearer?
There was a problem hiding this comment.
i mean in what cases will using safe height to clear the WAL result in issues
i.e. what is the behavior when Ethereum has a deep reorg? How can a signet node recover from that?
what other edge cases have you considered?
There was a problem hiding this comment.
You're right to push on this — I hadn't fully thought through the reorg implications.
Deep reorg scenario:
If we clear WAL up to safe height X, and then Ethereum reorgs past X:
- reth sends revert notification for reorged blocks
on_host_reverttries to unwind the RU chain- But reth may have already pruned the WAL entries we cleared — we can't replay from the ExEx side
Why this is problematic:
Safe blocks can reorg (unlike finalized). It requires >1/3 malicious validators, but it's not cryptographically impossible. If it happens, we'd be in an inconsistent state.
Edge cases I missed:
- Beacon client temporarily reports stale safe/finalized (network issues)
- Checkpoint sync completing mid-operation
- Safe head progressing faster than we process during backfill
Conclusion:
You're right — using safe height for WAL clearing is not safe. The original code is correct: only clear WAL for truly finalized blocks.
The WAL warning on testnets without finality is annoying but not critical — it's a space issue, not a correctness issue. Better to live with disk growth than risk unrecoverable state.
Should I close this PR?
|
seems to be a red herring. if we see this behavior in testnets we can revisit |
Summary
When the host chain doesn't report finalized blocks (beacon client not synced, checkpoint sync incomplete, or testnet without proper finality reporting), the WAL grows unbounded because
FinishedHeightis never sent to reth.This was observed on Pecorino with the warning:
Root Cause
In
update_canon_heights(),FinishedHeightis only sent whenfinalized_ru_block_hash != genesis_ru_hash. When the host doesn't report finalized blocks,load_finalized_block_heights()returns{ host: 0, rollup: 0 }, andFinishedHeightis never sent.Fix
Add a fallback to use the safe block height for
FinishedHeightwhen finalized blocks aren't available. Safe blocks are unlikely to reorg, making them a reasonable threshold for clearing the WAL.If there's a reorg beyond the safe block, reth will replay from the WAL as expected.
Closes ENG-1830